A currated list of useful tools for data analysis.

  • pandas_profile
  • pyviz
  • resumetable
  • feature_transform (my library)

I will explore the transformations using the wine data set from Kaggle.

from pathlib import Path


import pandas as pd
#import numpy as np
#from scipy.stats import kurtosis, skew
from scipy import stats
# import math
# import warnings
# warnings.filterwarnings("error")

from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)


root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'redwine'
csv_path = (base_dir+'/winequality-red.csv')
df=pd.read_csv(csv_path)
Mounted at /content/gdrive
# https://gist.github.com/harperfu6/5ea565ee23aaf8461a840c480490cd9a

pd.set_option("display.max_rows", 1000)
def resumetable(df):
    print(f'Dataset Shape: {df.shape}')
    summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name', 'dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    
    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = \
        round(stats.entropy(df[name].value_counts(normalize=True), base=2), 2)
    
    return summary

Typically, the first thing to do is examine the first rows of data, but this just gives you a very rudimentary feel for the data.

df = pd.read_csv(csv_path) 
df.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

I found resumetable() to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data.

Also, knowing the datatypes of each column is helpful when in comes to pre-processing the data.

I came across this funtion on Kaggle(I think) and found it incredibly helpful

resumetable(df)
Dataset Shape: (1599, 12)
Name dtypes Missing Uniques First Value Second Value Third Value Entropy
0 fixed acidity float64 0 96 7.4000 7.8000 7.800 5.94
1 volatile acidity float64 0 143 0.7000 0.8800 0.760 6.39
2 citric acid float64 0 80 0.0000 0.0000 0.040 5.87
3 residual sugar float64 0 91 1.9000 2.6000 2.300 4.78
4 chlorides float64 0 153 0.0760 0.0980 0.092 6.22
5 free sulfur dioxide float64 0 60 11.0000 25.0000 15.000 5.08
6 total sulfur dioxide float64 0 144 34.0000 67.0000 54.000 6.60
7 density float64 0 436 0.9978 0.9968 0.997 7.96
8 pH float64 0 89 3.5100 3.2000 3.260 5.91
9 sulphates float64 0 96 0.5600 0.6800 0.650 5.73
10 alcohol float64 0 65 9.4000 9.8000 9.800 5.19
11 quality int64 0 6 5.0000 5.0000 5.000 1.71

Another tool I use is pandas_profiling

import sys

!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Downloading https://files.pythonhosted.org/packages/dd/12/e2870750c5320116efe7bebd4ae1709cd7e35e3bc23ac8039864b05b9497/pandas_profiling-2.11.0-py2.py3-none-any.whl (243kB)
     |████████████████████████████████| 245kB 8.7MB/s 
Collecting phik>=0.10.0
  Downloading https://files.pythonhosted.org/packages/d9/27/d4197ed93c26d9eeedb7c73c0f24462a65c617807c3140e012950c35ccf9/phik-0.11.0.tar.gz (594kB)
     |████████████████████████████████| 604kB 8.7MB/s 
Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (7.6.3)
Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (0.4.2)
Collecting requests>=2.24.0
  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
     |████████████████████████████████| 61kB 8.3MB/s 
Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (0.11.1)
Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.4.1)
Collecting tqdm>=4.48.2
  Downloading https://files.pythonhosted.org/packages/d9/13/f3f815bb73804a8af9cfbb6f084821c037109108885f46131045e8cf044e/tqdm-4.57.0-py2.py3-none-any.whl (72kB)
     |████████████████████████████████| 81kB 8.6MB/s 
Collecting htmlmin>=0.1.12
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting confuse>=1.0.0
  Downloading https://files.pythonhosted.org/packages/6d/55/b4726d81e5d6509fa3441f770f8a9524612627dc1b2a7d6209d1d20083fe/confuse-1.4.0-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.19.5)
Collecting tangled-up-in-unicode>=0.0.6
  Downloading https://files.pythonhosted.org/packages/4a/e2/e588ab9298d4989ce7fdb2b97d18aac878d99dbdc379a4476a09d9271b68/tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1MB)
     |████████████████████████████████| 3.1MB 22.5MB/s 
Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (3.2.2)
Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (2.11.3)
Requirement already satisfied, skipping upgrade: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.1.5)
Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (20.3.0)
Collecting visions[type_image_path]==0.6.0
  Downloading https://files.pythonhosted.org/packages/98/30/b1e70bc55962239c4c3c9660e892be2d8247a882135a3035c10ff7f02cde/visions-0.6.0-py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 9.7MB/s 
Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.0.1)
Collecting jupyter-client>=6.0.0; extra == "notebook"
  Downloading https://files.pythonhosted.org/packages/83/d6/30aed7ef13ff3f359e99626c1b0a32ebbc3bf9b9d5616ec46e9e245d5fa9/jupyter_client-6.1.11-py3-none-any.whl (108kB)
     |████████████████████████████████| 112kB 56.4MB/s 
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (4.7.1)
Requirement already satisfied, skipping upgrade: numba>=0.38.1 in /usr/local/lib/python3.7/dist-packages (from phik>=0.10.0->pandas-profiling[notebook]) (0.51.2)
Requirement already satisfied, skipping upgrade: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.5.0)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.3.3)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.10.1)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied, skipping upgrade: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.1.2)
Requirement already satisfied, skipping upgrade: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.0.0)
Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2020.12.5)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (1.24.3)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied, skipping upgrade: pyyaml in /usr/local/lib/python3.7/dist-packages (from confuse>=1.0.0->pandas-profiling[notebook]) (3.13)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.3.1)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2018.9)
Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (2.5)
Collecting imagehash; extra == "type_image_path"
  Downloading https://files.pythonhosted.org/packages/8e/18/9dbb772b5ef73a3069c66bb5bf29b9fb4dd57af0d5790c781c3f559bcca6/ImageHash-4.2.0-py2.py3-none-any.whl (295kB)
     |████████████████████████████████| 296kB 51.7MB/s 
Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (7.0.0)
Requirement already satisfied, skipping upgrade: tornado>=4.1 in /usr/local/lib/python3.7/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (5.1.1)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (22.0.3)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.38.1->phik>=0.10.0->pandas-profiling[notebook]) (53.0.0)
Requirement already satisfied, skipping upgrade: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.38.1->phik>=0.10.0->pandas-profiling[notebook]) (0.34.0)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.8.0)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.1)
Requirement already satisfied, skipping upgrade: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.0.18)
Requirement already satisfied, skipping upgrade: decorator in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied, skipping upgrade: pygments in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.7/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied, skipping upgrade: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.0)
Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied, skipping upgrade: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied, skipping upgrade: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.9.2)
Requirement already satisfied, skipping upgrade: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied, skipping upgrade: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.3)
Requirement already satisfied, skipping upgrade: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.3.0)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.4.3)
Requirement already satisfied, skipping upgrade: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied, skipping upgrade: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.5.1)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (20.9)
Building wheels for collected packages: phik, htmlmin
  Building wheel for phik (setup.py) ... done
  Created wheel for phik: filename=phik-0.11.0-cp37-none-any.whl size=599736 sha256=871514972f68612cba0f76c994bcf24eb1497bc0ebf059015a90bfbf7d83e4f4
  Stored in directory: /root/.cache/pip/wheels/af/54/11/aba77f21075918de02f7964eabfe8c10d5542df9e6ad10b225
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-cp37-none-any.whl size=27085 sha256=b6d7224249ecf9c457b44708b117f1d3e2f0abd01a136b5c2a943b9cdc45db81
  Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
Successfully built phik htmlmin
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: phik, requests, tqdm, htmlmin, confuse, tangled-up-in-unicode, imagehash, visions, jupyter-client, pandas-profiling
  Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
  Found existing installation: jupyter-client 5.3.5
    Uninstalling jupyter-client-5.3.5:
      Successfully uninstalled jupyter-client-5.3.5
  Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
Successfully installed confuse-1.4.0 htmlmin-0.1.12 imagehash-4.2.0 jupyter-client-6.1.11 pandas-profiling-2.11.0 phik-0.11.0 requests-2.25.1 tangled-up-in-unicode-0.0.6 tqdm-4.57.0 visions-0.6.0
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

from ipywidgets import widgets

# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
profile = ProfileReport(df, title="red wine", html={"style": {"full_width": True}}, sort="None")

Takes a couple minutes to process and display the results even with a small dataset.

You do get some richer analysis like Correlation plots and distributions of the variables.

profile.to_widgets()
/usr/local/lib/python3.7/dist-packages/pandas_profiling/profile_report.py:424: UserWarning: Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60).As an alternative, you can use the HTML report. See the documentation for more information.
  "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."
profile.to_notebook_iframe()

-

An alternative is Sweetviz. I tend to like this a bit better for its display of distributions. In general, it also loads a bit quicker.

!pip -q install sweetviz
     |████████████████████████████████| 15.1MB 325kB/s 
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook(w=1200.)

My tool for numerical transformations

I wanted a simple way to view the distributions of the features and more importantly a way to view the data after numerical transformations such as Box-Cox or a log transform.

The following plot is a sample of what I developed. The first plot is the input and the next plots show the various transformations along with their skew and kurtosis. What is highlighted in pink is the transform that was automatically selected to yield the most gausian distribution.

Numerical transfors

See this

Post for more detail.